from IPython.display import Image
Image("covid19.png")
Coronaviruses are a large family of viruses which may cause illness in animals or humans. In humans, several coronaviruses are known to cause respiratory infections ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). The most recently discovered coronavirus causes coronavirus disease COVID-19.COVID-19 is the infectious disease caused by the most recently discovered coronavirus. This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019.
Goal: To analyse and track the coronavirus outbreak and be aware of the actual stats that are affecting the areas that we live in and the whole world.Through this project we aim to perform an analyis on available data on COVID-19. We plan to use classifiers, unsupervised learning and forecasting to help arrive at our mentioned objectives.
1) Through our EDA, we are predicting the survival rate of individuals in USA and worldwide.
2) The model with all the coefficients leading to death or even infection
3) Prediction of the curve (when it is going to end and when it will flatten)
4) To implement an unsupervised learning model (K-means clustering) to identify for underlying groups or categories
existing in our data and gain possible insights.
5) Best Forecasting model that would predict the optimal future trends.
Researchers,Statisticians, hospitals, common people
The dataset has been collated from Johns Hopkins Github repository and we have utilised 2 level of csv files to identify Exploratory Data Analysis and perform statistical modelling on the coronavirus outbreak data. It has over 18000 records with 23 columns over a period of January to April 2020 used for further analysis. The columns provide the total number of cases as well detailed information on patients.
from IPython.core.display import HTML
HTML('''<div class="flourish-embed flourish-cards" data-src="visualisation/1810417" data-url="https://flo.uri.sh/visualisation/1810417/embed"><script src="https://public.flourish.studio/resources/embed.js"></script></div>''')
#pip install plotly
#pip install "notebook>=5.3" "ipywidgets>=7.2"
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime as dt
from datetime import timedelta
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,silhouette_samples
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error,r2_score
import statsmodels.api as sm
from statsmodels.tsa.api import Holt,SimpleExpSmoothing,ExponentialSmoothing
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
std=StandardScaler()
#pd.set_option('display.float_format', lambda x: '%.6f' % x)
# color pallette
cnf, dth, rec, act = '#393e46', '#ff2e63', '#21bf73', '#fe9801'
import matplotlib.dates as mdates
import plotly.express as px
from datetime import date, timedelta
from sklearn.cluster import KMeans
import plotly.offline as py
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
covid = pd.read_csv(r"C:/Users/ancna/OneDrive/Desktop/STATS_ML/covid_25thApr.csv")
covid.shape
covid.head(10)
parse_dates=['Date']
covid['Active'] = covid['Confirmed'] - covid['Deaths'] - covid['Recovered']
covid.head()
covid.tail(10)
covid[['States']] = covid[['States']].fillna('')
covid[['Confirmed', 'Deaths', 'Recovered', 'Active']] = covid[['Confirmed', 'Deaths', 'Recovered', 'Active']].fillna(0)
grouped = covid.groupby('Date')['Date', 'Confirmed', 'Deaths'].sum().reset_index()
grouped['Date'] =pd.to_datetime(grouped.Date)
grouped.sort_values(by=['Date'], inplace=True, ascending=True)
fig = px.line(grouped, x="Date", y="Confirmed", title="Worldwide Confirmed Cases Over Time")
fig.show()
Here we can see that exponential growth curve increase slowly in the beginning, but the gains increase rapidly later March 12th 2020.The slope of the line at the latest time frame is very high and it makes the situation even worse. The pandemic is still in its peak as per the data that is gathered.
covid['States'] = covid['States'].fillna('')
temp = covid[[col for col in covid.columns if col != 'States']]
latest = temp[temp['Date'] == max(temp['Date'])].reset_index()
latest_grouped = latest.groupby('Country')['Confirmed', 'Deaths','Active'].sum().reset_index()
fig = px.choropleth(latest_grouped, locations="Country",
locationmode='country names', color="Confirmed",
hover_name="Country", range_color=[5000,10000],
color_continuous_scale="greens",
title='Countries with Confirmed Cases')
fig.show()
In the above map, we can see USA, European countries, China, Iran, Italy have very high number of confirmed cases and there are high chances of rise further.
fig = px.bar(latest_grouped.sort_values('Confirmed', ascending=False)[:20][::-1],
x='Confirmed', y='Country',
title='Confirmed Cases Worldwide', text='Confirmed', height=1000, orientation='h', color_discrete_sequence=['#FF8007'])
fig.show()
Looking at the numbers, we can see US is the most affected with a total count of 461437 followed by Spain and Italy. Sweden is the least affected with a total of 9141 confirmed cases.
usa = covid[covid['Country'] == "US"]
usa['Date'] =pd.to_datetime(usa.Date)
usa.sort_values(by=['Date'], inplace=True, ascending=True)
usa_latest = usa[usa['Date'] == max(usa['Date'])]
usa_latest = usa_latest.groupby('States')['Confirmed', 'Deaths','Active'].max().reset_index()
fig = px.bar(usa_latest.sort_values('Confirmed', ascending=False)[:10][::-1],
x='Confirmed', y='States', color_discrete_sequence=['#9678B6'],
title='Confirmed Cases in USA', text='Confirmed', orientation='h')
fig.show()
In the American states, New York is the most affected state with total of 271590 followed by New Jersey.
fig = px.choropleth(latest_grouped, locations="Country",
locationmode='country names', color="Deaths",
hover_name="Deaths", range_color=[100,10000],
color_continuous_scale="oryel",
title='Countries with Reported Deaths')
fig.show()
The Deaths however seems to come more from cetain parts of the World...USA being the most significant of ones with total of 16478, followed by Europe and then China.
fig = px.choropleth(latest_grouped, locations="Country",
locationmode='country names', color="Active",
hover_name="Active", range_color=[1000,10000],
color_continuous_scale="HSV",
title='Active Cases Worldwide')
fig.show()
Here, we can see US has highest number of active cases 419.549K followed by European countries.
usa = covid[covid['Country'] == "US"]
usa_latest = usa[usa['Date'] == max(usa['Date'])]
usa_latest = usa_latest.groupby('States')['Confirmed', 'Deaths', 'Active', 'Recovered'].max().reset_index()
fig = px.bar(usa_latest.sort_values('Active', ascending=False)[:10][::-1],
x='Active', y='States', color_discrete_sequence=['#29CDCF'],
title='Active Cases in USA', text='Active', orientation='h')
fig.show()
The active cases in USA is maximum in New York with a total of 154712 followed by New Jersey.
formated_gdf = covid.groupby(['Date', 'Country'])['Confirmed', 'Deaths'].max()
formated_gdf = formated_gdf.reset_index()
formated_gdf['Date'] = pd.to_datetime(formated_gdf['Date'])
formated_gdf['Date'] = formated_gdf['Date'].dt.strftime('%m/%d/%Y')
formated_gdf['size'] = formated_gdf['Confirmed'].pow(0.3)
fig = px.scatter_geo(formated_gdf, locations="Country", locationmode='country names',
color="Confirmed", size='size', hover_name="Country",
range_color= [0, 1500],
projection="natural earth", animation_frame="Date",
title='COVID-19: Spread Over Time', color_continuous_scale="portland")
fig.show()
temp = covid.groupby('Date')['Recovered', 'Deaths', 'Active'].sum().reset_index()
temp = temp.melt(id_vars="Date", value_vars=['Recovered', 'Deaths', 'Active'],
var_name='case', value_name='count')
temp['Date'] =pd.to_datetime(temp.Date)
temp.sort_values(by=['Date'], inplace=True, ascending=True)
fig = px.area(temp, x="Date", y="count", color='case',
title='Cases over time: Area Plot', color_discrete_sequence = ['cyan', 'red', 'orange'])
fig.show()
Active cases raised drastically after March 24th 2020 and might see a big rise in the trend. Also, we can see there is slow increase in death and recovered cases as well.
fig = px.bar(covid, x="Date", y="Deaths", color='Country', height=600,
title='Deaths over Date', color_discrete_sequence = px.colors.cyclical.Edge)
fig.show()
We can see as of April 2020, Italy has the highest death count of 25.969K, followed by Spain and US.
fig = px.bar(covid, x="Date", y="Recovered", color='Country', height=600,
title='New cases', color_discrete_sequence = px.colors.cyclical.IceFire)
fig.show()
Germany has the highest recovery rate of 109.8K followed by US with 99.079K and spain 92.355K
# Reading another csv file including detailed level data for further analysis:
data = pd.read_csv(r"C:/Users/ancna/OneDrive/Desktop/STATS_ML/Covid_CompleteList_New.csv")
data.shape
data.head()
data.shape
data.isnull().sum() # checking for null values
data.dtypes #fetching the datatypes of the dataframe.
data['Hosp Visit Date'] = pd.to_datetime(data['Hosp Visit Date'])
data["agegrp"].unique()
data["IsDeath"].unique()
sns.countplot(data=data, x='Gender')
Through the bar graph, males are likely prone to get infected compared to female.
# A Function To Plot Pie Plot using Plotly
import plotly.graph_objs as go
def pie_plot(cnt_srs, colors, title):
labels=cnt_srs.index
values=cnt_srs.values
trace = go.Pie(labels=labels,
values=values,
title=title,
hoverinfo='percent+value',
textinfo='percent',
textposition='inside',
hole=0.7,
showlegend=True,
marker=dict(colors=colors,
line=dict(color='#000000',
width=2),
)
)
return trace
dead = data[data.IsDeath == 1]
dead.sort_values(by=['agegrp'], inplace=True, ascending=True)
#dead.head()
fig=py.iplot([pie_plot(dead['agegrp'].value_counts().sort_values(ascending=False).head(10), ['seaborn'], 'Age of Dead person')])
Through the iplot, the age group of 61-70 are at higher end of the risk with a total death cases of 31% followed by 81-90 and least being 31-50 range with over 5% cases.
custom_palette = sns.color_palette("Reds")[3:4] + sns.color_palette("Blues")[2:5]
plt.figure(figsize=(12, 8))
sns.countplot(x = "agegrp",hue="Gender",palette=custom_palette,data=dead)
The bar graph shows the categorization of female and male and the total deaths for various age groups. Clearly, deaths in male is higher compared to female at every age group. The age-group of 61-70 have maximum number of deaths in both male and female.
## The below csv files has travel history and infected cases in the areas of China, Korea and Mangolia.
# Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.express as px
from datetime import date, timedelta
from sklearn.cluster import KMeans
import plotly.offline as py
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
#Reading Data
df_patient = pd.read_csv("C:/Users/ancna/OneDrive/Desktop/STATS_ML/patient.csv")
df_route = pd.read_csv("C:/Users/ancna/OneDrive/Desktop/STATS_ML/route.csv")
df_patient.shape
df_patient.isna().sum()
# Preprocessing
df_patient['birth_year'] = df_patient.birth_year.fillna(0.0).astype(int)
df_patient['birth_year'] = df_patient['birth_year'].map(lambda val: val if val > 0 else np.nan)
df_patient.confirmed_date = pd.to_datetime(df_patient.confirmed_date)
daily_count = df_patient.groupby(df_patient.confirmed_date).id.count()
accumulated_count = daily_count.cumsum()
Confirmed Cases
data = daily_count.resample('D').first().fillna(0).cumsum()
data = data[20:]
x = np.arange(len(data)).reshape(-1, 1)
y = data.values
accumulated_count.plot()
predicted_count.plot()
plt.title('Prediction of Accumulated Confirmed Count of China,Korea and Mangolia')
plt.legend(['current confirmed count', 'predicted confirmed count'])
plt.show()
The trend analysis shows the rise of confirmed cases from 22nd February 2020 and the predicted confirmed count shows further rise in the future in the above areas. This is per the data available and can see potential raise if necessary precautions and restrictions are not applied on the respective countries.
df = pd.read_csv(r"C:/Users/ancna/OneDrive/Desktop/STATS_ML/Covid_CompleteList.csv")
df.columns
#Remove columns which are not used in analysis
df.drop(['link','source','case_in_country','id','If_onset_approximated','summary','exposure_end'],axis=1,inplace=True)
df.columns
df['death'].isnull().sum()
According to dataset 0s are survival while 1s are death hence we fill NAs as 0s and dates as 1
df['death'].fillna(0, inplace=True)
#For the purpose of data cleaning in death column we first convert it into string
df['death'] = df['death'].astype(str)
df['death'].dtypes
df['death'] = df['death'].replace(to_replace ='^2/.', value = '1', regex = True)
df['death'] = df['death'].replace(to_replace ='^[0-9]*/.', value = '1', regex = True)
df['death'] = df['death'].replace(to_replace ='^[0-9]+.', value = '1', regex = True)
#Converting back to neumeric
pd.to_numeric(df['death'])
df['death'] = df['death'].astype(int)
#Defining the function count plot
def count_plot(dataframe, column_name, title =None, hue = None):
base_color = sns.color_palette()[0]
sns.countplot(data = dataframe, x = column_name, hue=hue)
plt.title(title)
#Visualizing survived and death column
count_plot(df,"death","overall life and death comparision")
#Defining a simple bar plot function
def simple_bar_plot(data, title = None):
data.plot("bar", title = title)
simple_bar_plot(df.groupby('country')['death'].count(), title = "Distribution of death rate of countries")
Now looking at the data we intend to make a Difference column which will have days a patient takes to visit hospital after exposure/infected by Covid 19
df['exposure_start'].dtypes
#Converting it into proper date format first
df['exposure_start'] = pd.to_datetime(df['exposure_start'])
#Converting it into proper date format first
df['hosp_visit_date'] = pd.to_datetime(df['hosp_visit_date'])
#Even this so that we can merge symptom onset and exposure start to one
df['symptom_onset'] = pd.to_datetime(df['symptom_onset'])
#Placing null values of symptom_onset to exposure_start where null and then deleting the 'symptom_onset' column
df.exposure_start.fillna(df.symptom_onset, inplace=True)
del df['symptom_onset']
#Creating Difference column
df['Difference'] = df['hosp_visit_date'].sub(df['exposure_start'], axis=0)
#Converting Difference column to number column
df['Difference'] = df['Difference'] / np.timedelta64(1, 'D')
#Converting to neumeric
pd.to_numeric(df['Difference'])
df['Difference'].isnull().sum()
#Dropping null values from this column
df = df[pd.notnull(df['Difference'])]
df['Difference'].isnull().sum()
#Converting from float to integer
df['Difference'] = df['Difference'].astype(int)
#Cleaning and processing of symptom column
df['symptom'].dtypes
df['symptom'].isnull().sum()
#Replacing null by unknown
df['symptom'].fillna('unknown', inplace=True)
df['symptom'].isnull().sum()
#Cleaning of symptom column using Regex
df['symptom'] = df['symptom'].replace(to_replace ='feaver', value = 'fever', regex = True)
df['symptom'] = df['symptom'].replace(to_replace ='^f.*\$', value = 'fever', regex = True)
df['symptom'].value_counts()
#Since there are a lot of unknown values hence first removing unknown values and then visualizing top symtomps of COVID19
df1 = df[df.symptom != 'unknown']
#df1['symptom'].value_counts()
simple_bar_plot(df1["symptom"].value_counts(sort=True)[:20],
title = "Top 20 symptoms of corona virus infection")
We saw that according to our dataset fever, cough and shortness of breath are the main symptoms of COVID-19 infection
df['gender'].isnull().sum()
df = df[pd.notnull(df['gender'])]
df['gender'].isnull().sum()
df['age'].isnull().sum()
df = df[pd.notnull(df['age'])]
df['age'].isnull().sum()
df['age'].astype(int)
We decided to use 'age', 'Difference', 'Gender' columns to predict the chances of survival or death for a infected
person.We decided to Random Forest Model because is a meta estimator that fits a number of decision tree classifiers on
various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.And
also it is a Multiclass Classifier hence using predict_proba method which is very efficient and cheks for probabilty
of each class
# Changing the values of male and female neumeric to put in the model
df.gender[df.gender == 'male'] = 1
df.gender[df.gender == 'female'] = 2
#Defining X and Y
X = df[['age', 'Difference', 'gender']]
Y = df['death']
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=100)
Evaluating a classifier is more complicated than evaluating a regressor
(t-test, F-test, etc.). One way to evaluate a model is to use
cross-validation. Cross-validation is when you break up the training data
into di↵erent subsets, called K-folds. For example, when K=3, the
training data will be broken into 3 subsets.Then the model is trained and
tested 3 times (K times). In each iteration of training and testing, 1 subset
is chosen to be the test set, while the other 2 are used to train the model.
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
forest_scores=cross_val_predict(rfc,X_Train,Y_Train,cv=3,method='predict_proba')
forest_scores[0]
#The predict proba method returns a probability that each image
The 0.32 represents that the classifier says with 32% probability that the first class is 1, and the 0.68 represents 68% probability of it being in the 0 class. Therefore, the model says it is 0.
Y.iloc[0:1]
When we check manually it gives us 0 as well hence our model is correct in predicting
#For checking accuracy in rfc
from sklearn.model_selection import cross_val_score
cross_val_score(rfc,X_Train,Y_Train,cv=3,scoring='accuracy')
We can see the accuracy is pretty good in all the three folds of validation!
But forest score are probabilities not scores, a simple solution is to use the
positive class’s (1s) probability as the score. We will use the same in a later point in below and find out
rc_auc score, precision, recall
rfc.fit(X_Train,Y_Train)
Y_Pred = rfc.predict(X_Test)
Y_Pred
In case of a skewed dataset, meaning there are some
classes that are much more frequent than others for example : There maybe a lot more survivals (0s) then deaths (1)
In such cases confusion matrix is better then cv to check for accuracy
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_Test, Y_Pred)
cm
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_Train,forest_scores[:,1])
Our roc_auc score is pretty good!
Y_Pred1=cross_val_predict(rfc,X_Train,Y_Train,cv=3) #Finding y prediction through cross validation
It is the accuracy of the positive predictions of a certain class. . In
the case of our analysis, we will calculate the precision of our classifier in
positively identifying both 0s (deaths) and survivals(1s). For example, for the ones
predicted to be 1, how many were actually 1.
It also called sensitivity or true positive rate (TPR), is the ratio of
positive instances that are correctly detected by the classifier.
#Finding precision score
from sklearn.metrics import precision_score
precision_score(Y_Train ,Y_Pred1)
#Finding recall score
from sklearn.metrics import recall_score
recall_score(Y_Train ,Y_Pred1)
#Find precision recalls threshold and plotting them
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(Y_Train, Y_Pred1)
plt.plot(recalls,precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0 ,1])
plt.xlim([0 ,1])
plt.show()
#From above plot we could see that Recall could be higher when Precision is less then 40%
We use a measure that combines precision and recall into a single variable
called the F1 score, which is a harmonic mean. A harmonic mean gives
more weight to lower values, as a result a classifier can only get a high
score if both values are high.
from sklearn.metrics import f1_score
f1_score(Y_Train ,Y_Pred1)
We see our F1 score is 42 % which is distributed on both precesion and recall
threshold_40=thresholds[recalls[:-1]>=.4].max()
threshold_40
#We find the threshold for recall since in our model recall is more important then precesion and the curve falls at 40%
%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (20, 12)
plt.style.use('ggplot')
f2 = df['age'].values
f1 = df['Difference'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f2, f1, c='black', s=9)
# Euclidean Distance Caculator
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print(C)
# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')
The green stars depicts random postion of the centroids before start of K- means clustering
# To store the value of centroids when it updates
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero
while error != 0:
# Assigning each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# Storing the old centroid values
C_old = deepcopy(C)
# Finding the new centroids by taking the average value
for i in range(k):
points = [X[j] for j in range(len(X)) if clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')
This method takes into account the trend of the dataset. By trend,
we mean the increasing or decreasing nature of the series. Suppose the number of bookings in a hotel
increases every year, then we can say that the number of bookings show an increasing trend. The forecast
function in this method is a function of level and trend.
This is one more dataset which contains all the confirmed cases world wide and is much resourceful for forecasting
df9 = pd.read_csv(r"C:/Users/ancna/OneDrive/Desktop/STATS_ML/covid_19_data.csv")
Since New York is next target of whole world and it is highly diversed hence we will forecast the infection
rate in New York and we will first try with Holt;s Linear Trend because we see a increasing trend in New York as of now
US = df9[df9['Province/State'] == "New York"]
df9.columns
#We are only taking taking the time series univarite data of date and confimed cases
df10 = US[['ObservationDate','Confirmed']]
df10 = df10.set_index('ObservationDate')
prophet= pd.DataFrame(df10)
df10.head()
df10.tail() # Confirming it is a time series data
df10.dtypes
#Preparing train and test data
model_train=df10.iloc[:int(df10.shape[0]*0.90)]
valid=df10.iloc[int(df10.shape[0]*0.90):]
from statsmodels.tsa.api import Holt,SimpleExpSmoothing,ExponentialSmoothing
model_scores=[]
holt=Holt(np.asarray(model_train["Confirmed"])).fit(smoothing_level=0.3, smoothing_slope=0.4)
y_pred=valid.copy()
y_pred["Holt"]=holt.forecast(len(valid))
model_scores.append(np.sqrt(mean_squared_error(y_pred["Confirmed"],y_pred["Holt"])))
print("Root Mean Square Error Holt's Linear Model: ",np.sqrt(mean_squared_error(y_pred["Confirmed"],y_pred["Holt"])))
plt.figure(figsize=(15,5))
plt.plot(model_train.Confirmed,label="Train Set",marker='o')
valid.Confirmed.plot(label="Validation Set",marker='*')
y_pred.Holt.plot(label="Holt's Linear Model Predicted Set",marker='^')
plt.ylabel("Confirmed Cases")
plt.xlabel("Date Time")
plt.title("Confirmed Holt's Linear Model Prediction")
plt.xticks(rotation=90)
plt.legend()
We could see our model is predicting pretty well. However let us try ARIMA model considering it might not have trend as well. But we have to convert time series data into stationary data for the same
ARIMA is a very popular technique for time series modeling. It describes the correlation between data
points and takes into account the difference of the values
The data series is stationary, which means that the mean and variance should not vary with time.
A series can be made stationary by using log transformation or differencing the series.
The data provided as input must be a univariate series, since arima uses the past values to predict
the future values.
ARIMA models provide another approach to time series forecasting.
Exponential smoothing and ARIMA models are the two most widelyused approaches to time series forecasting,
and provide complementary approaches to the problem
While exponential smoothing models are based on a description of
trend and seasonality in the data, ARIMA models aim to describe the
autocorrelations in the data and we use ARIMA considering in COVID 19 no seasonality
and trend as well
We tend to convert the data into stationary which makes prediction of future values easy
a stationary time series is one whose properties do not depend on the time at which the series is observed
confirm_cs = prophet.cumsum()
arima_data = confirm_cs.reset_index()
arima_data.columns = ['ObservationDate','Confirmed']
We used order of (0,1,0) because it is a special case of ARIMA that
is random walk model and we use random walk because covid 19 data
is data with stochastic trends and A random walk often provides a good fit to it.
The forecasts from a random walk model are equal to the last
observation, as future movements are unpredictable, and are
equally likely to be up or down. Thus, the random walk model
underpins naïve forecasts.
model = ARIMA(arima_data['Confirmed'].values, order=(0,1,0))
fit_model = model.fit(trend='c', full_output=True, disp=True)
fit_model.summary()
Akaike’s Information Criterion (AIC): It should be smaller. The Akaike Information Critera (AIC) is a widely
used measure of a statistical model. It basically quantifies 1) the goodness of fit, and 2) the simplicity/parsimony,
of the model into a single statistic. When comparing two models, the one with the lower AIC is generally “better”.
But since we used auto_arima so it is in best form
BIC Bayesian Information Criterion: It should be smallest. The BIC resolves this problem by introducing a penalty
term for the number of parameters in the model. The penalty term is large in BIC than in AIC.
Good models are obtained by minimizing either the AIC, AICc or BIC
We tend to use auto arima for best values of p and q and hence we need to install pmdarima
#pip install pmdarima
import pmdarima
from pmdarima import auto_arima #Make series stationary, Determine d value,Create ACF and PACF plots for determining parameters,
#Determine the p and q values
import warnings
warnings.filterwarnings("ignore")
stepwise_fit = auto_arima(arima_data['Confirmed'], start_p = 1, start_q = 1, # The auto_arima helps us in Selecting appropriate values for p, d and q for the best model
max_p = 3, max_q = 3, m = 12,
start_P = 0, seasonal = True,
d = None, D = 1, trace = True,
error_action ='ignore', # we don't want to know if an order does not work
suppress_warnings = True, # we don't want convergence warnings
stepwise = True)
stepwise_fit.summary()
fit_model.plot_predict()
plt.title('Forecast vs Actual')
pd.DataFrame(fit_model.resid).plot()
Assuming we have observations up to time T, all values
are known except for eT+1 which we replace by zero and eT
which we replace by the last observed residual eT
A forecast of yT+2 is obtained by replacing t by T+2
The process continues in this manner for all future time periods.
Hence, in this way, any number of point forecasts can be obtained.
forcast = fit_model.forecast(steps=46)
pred_y = forcast[1].tolist()
pd.DataFrame(pred_y)
#calculate rmse
from math import sqrt
from sklearn.metrics import mean_squared_error
rms = sqrt(mean_squared_error(arima_data['Confirmed'],pred_y))
print(rms)
1) From our analysis and EDA we identified USA would be the most affected country trending in larger number
of confirmed and death cases in current situation followed by European countries.
2) Through the RFC model we see that though death count is low at the moment but it would potentially increase
in future without necessary steps and age, gender, getting medical attention are not the only attributes to
be considered for survival and we could improve our model by maximizing recall.
3) From both Holts Linear Trend and ARIMA model we could forecast increasing number of cases. But HOLTS Linear
Trend model is better for forecasting in COVID 19 case since we could see an increasing trend and fluctuations
are irrevalent and the same could be endrosed by rmse values of both the models.
4)From our K means clustering we found that age group of 20-50 and 50-100 takes around maximum of 20 days
( with maximum people in the group of 50-100) to seriously fall ill which is described by
Difference between infected date and hospital visit date. And also there is one more group which takes more then
30 days and is diversed from the prospect of age.
1) John Hopkins Git repository: https://github.com/CSSEGISandData/COVID-19
2)https://www.cdc.gov/coronavirus/2019-ncov/index.html
3)https://www.analyticsvidhya.com/blog/2018/08/auto-arima-time-series-modeling-python-r/
from IPython.display import Image
Image("End.png")